-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mekhanik evgenii/1196 #17
Conversation
46d5070
to
45138c4
Compare
15dfeaf
to
6c253f3
Compare
6c253f3
to
ae1718e
Compare
9660c99
to
042a7d9
Compare
67f5e65
to
fe13b02
Compare
fe13b02
to
2960327
Compare
include/net/tcp.h
Outdated
if (tcp_send_head(sk)) { | ||
struct tcp_sock *tp = tcp_sk(sk); | ||
|
||
#ifdef CONFIG_SECURITY_TEMPESTA | ||
if (mss_now != 0) | ||
__tcp_push_pending_frames(sk, mss_now, tp->nonagle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we initialize mss_now
using tcp_current_mss()
and get rid of mss_now != 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think about it, but we don't need to calculate mss_now if SOCK_TEMPESTA_HAS_DATA is not set and tcp_send_head is empty. I don't want to make extra calculations
|
8d02a09
to
13850d4
Compare
b198d70
to
9472c4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
437c1ef
to
6bc466c
Compare
fdab4d7
to
d1aaa66
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
struct tcp_sock *tp = tcp_sk(sk); | ||
unsigned int in_flight = tcp_packets_in_flight(tp); | ||
unsigned int send_win, cong_win; | ||
unsigned int limit; | ||
int result; | ||
|
||
if (!sk->sk_write_xmit || !skb_tfw_tls_type(skb)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (!sk->sk_write_xmit || !skb_tfw_tls_type(skb)) | |
if (unlikely(!sk->sk_write_xmit || !skb_tfw_tls_type(skb))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view this patch has some points to improve.
Let's start from discussion about the goals. As I understand our goal is to have as low as possible data in write_queue
. For instance, if client is able to receive 1024 bytes, we must place the skb with size 1024 bytes to write_queue
and send it in one shot, avoiding any segmentation. In other words tso_fragment
must not be called at all.
However this patch works little bit different, the first problem described in comment that marked (2). The second problem is that ss_calc_snd_wnd()
calculates cnd_wnd
in a wrong way. It completely ignores TCP segmentation limits(see tcp_tso_segs()
) and calculates cwnd without choosing min between halfcwnd and in flight bytes(see tcp_cwnd_test()
). Second problem leads to segmentation in case when we have interface that has poor count of tso segments or congestion control module limits number of tso segs, thus we place big skb in queue which size bigger than max_segs
.
Just assumption, maybe we can cache limit
from ss_calc_snd_wnd()
and use it in tcp_write_xmit
, but it's only theory.
struct tcp_sock *tp = tcp_sk(sk); | ||
unsigned int in_flight = tcp_packets_in_flight(tp); | ||
unsigned int send_win, cong_win; | ||
unsigned int limit; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) Why do we recalculate limit here? We have limit
that calculated in tcp_write_xmit
, the limit
calculated correctly and we can use it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because we want to make one TLS record for several skbs. For example if mtu is equal 1500, our previously calculated limit is also about 1500, but if there are several skbs in our socket write queue we can encrypt them all in only one tls record! And do not call tls_encrypt several times
__func__, mss_now); | ||
break; | ||
} | ||
TFW_ADJUST_TLS_OVERHEAD(limit); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(2) This is looks not optimal. We calculate a limit
, then in TFW_ADJUST_TLS_OVERHEAD
we reduce the limit by tls overhead, the limit becomes lower than skb->len
thus tso_fragment
will split this skb, this is unnecessary overhead. After in tfw_tls_encrypt
we will encrypt boths skbs, but the last one anyway will hang in write_queue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes because we encrypt several skbs we don't need to adjust overhead
struct { | ||
__u8 present : 1; | ||
__u8 tls_type : 7; | ||
__u16 flags : 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need flags
to notify when skb is transmitted, e.g. tempesta-tech/tempesta#2108.
Of course, on_tcp_entail
can do this as well, but it is not currently used for this purpose, at least not for all control frames.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should implement additional on_send_ function in tempesta for 2108, not use flags in skb. Moreover we remove additional callback from kernel code, so 2108 should be reworked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it can't be in on_send
callback, it only indicates entry into the write queue of sk, but the counter decrement must happen at the time of actual transmission, because only at that point in time the skb is moved to the TCP protocol transmission process, i.e. tcp_write_xmit
or its callers, e.g. __tcp_push_pending_frames
, and not accumulated in kernel memory for some reason, such as congestion or zero TCP window of the malicious peer. This is exactly the meaning of this counter, to avoid control frame flooding.
That is, for 2108, the solution is either to keep the counter decrement in sk_write_xmit
, and the tfw_cb.flag
is still needed, or to manipulate the counter directly in ss_skb_on_tcp_entail
. No additional callbacks needed here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ss_skb_on_tcp_entail
is called before pushing skb to socket write queue. It is a good place to decrement counter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applications which works in user space and counted control frames don't know nothing about __tcp_push_pending_frames
and there is no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can postpone such frames if it is really necessary (and send them only when we have enough window) and count them in ss_skb_on_tcp_entail
. I think that we move tls_encrypt to Tempesta code and there is no sense to have one more callback in linux kernel only to count control frames. User space applications like nginx doesn't know anything about tcp_transmit_skb
. How they can count control frames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User space applications like nginx doesn't know anything about tcp_transmit_skb. How they can count control frames?
@EvgeniiMekhanik @krizhanovsky @const-t
I have to update some thoughts.
No, the user program relies on sockbuf to determine whether the skb send is in-flight. If sockbuf is full, the send()
system call will fail because the peer does not provide enough window, or in non-blocking fd, the write event is shelved, then the counter will accumulate, and then the user program will know that it is in a flood attack.
(Moreover, the user program can use TCP_INFO getsockopt to retrieve the TCP information to get known the in-flight bytes. In fact, Nginx uses it to define some variables, although not for our purpose exactly.)
However, in our case, it's a completely different story. tempesta does not use sockbuf, but directly enqueues to sk's sk_write_queue, so it will have memory risks unless we know that skb has started to be transmitted, that is, in-flight. How to know this fact? The only place is that tcp_transmit_skb
successfully handles skb. So this is why I suggest that we should apply the counting logic there. Neither on_send (called in ss_tx_action) nor entail (corresponding to the original prepare_xmit), or even write_xmit callback, can meet our needs to mitigate the flood of control frame attacks. This is not a theory, but has been verified through testing.
And the method suggested by @EvgeniiMekhanik does not work either, because similar to entail, the skbs pushed when the window is available does not mean that all skbs can be sent immediately, because it does not know how large the network capacity is and how many skbs can be sent, so it's likely that some skbs will be hold in the memory for an undetermined time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call ss_skb_on_tcp_entail
just before we push skb to socket write queue. For HEADERS and DATA frames we calculate tcp window in sk_fill_write_queue
and don't call ss_skb_tcp_entail
->ss_skb_on_tcp_entail
if we have no enough window. To solve problem with control frames we should not call ss_skb_tcp_entail
->ss_skb_on_tcp_entail
for all other type of frames if there is no enough tcp window. We should push them to the queue of postponed frames and push them to socket write queue only when we have enough TCP window. In this case we can counted control frames in ss_skb_on_tcp_entail
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. This is sufficient if we do not care about the strict accuracy of whether the skb was successfully sent by tcp_transimit_skb
or was even eliminated from tcp_rtx_queue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the flags
are replaced with the on_send
callback in #1973, so it's not desired to keep flags, especially with only one user. Meantine, I also don't like the idea of additional queueing since to wasted memory decreasing performance and opens opportunities for DoS attacks.
However, in first place, I didn't get what do we fix with #2108, so I requested changes. Maybe I'm wrong and let's discuss the fix, but I proposed to control and limit flows to client in tempesta-tech/tempesta#498 (comment)
P.S. @kingluo @EvgeniiMekhanik thank you for inviting me to the discussion
unsigned int limit, | ||
unsigned int skbs); | ||
unsigned int limit); | ||
int (*sk_fill_write_queue)(struct sock *sk, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should document this breaking change, explaining why sk_prepare_xmit
is deprecated, and why the new sk_fill_write_queue
requires some somewhat complicated limit calculations. If not, maybe the author himself will forget about it after a few weeks. It's better to document it in the code as a comment, because tempesta-tech/tempesta#1196 only discusses some theoretical design (which is even outdated after the code rewrite), not the actual implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it could make sense to document both the two callback, when they are called and what they do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I don't quite understand why we moved frames from prepare_xmit
to tcp_push_pending_frames
, which is the biggest change in this PR. In prepare_xmit
, we already know that TCP has enough window to send data, but in tcp_push_pending_frames
, we have to calculate it ourselves, which seems to be duplication of work. Could you clarify this further in comments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will add comment before merge PR. There were several problems with the previous approach:
- We push response skb directly to socket write queue. If skb belongs to stream, which is blocked by exceeding HTTP2 stream window, we can't send this skb to the client. This skb blockes all other skb in socket write queue, until window update will come.
- We can't prioritize streams. We push responses to socket write queue directly, so there is no prioritization. Now we push response skbs to our private scheduler queue. Then we call
tcp_push_pending_frames
where we run our scheduler to choose the most priority stream and send data according TCP window. Alsotcp_push_pending_frames
called fromtcp_data_snd_check
when we receive data from the peer and TCP window opened again.
18581a9
to
9092882
Compare
There were several problems with the previous approach: - We push response skb directly to socket write queue. If skb belongs to stream, which is blocked by exceeding HTTP2 stream window, we can't send this skb to the client. This skb blockes all other skb in socket write queue, until window update will come. - We can't prioritize streams. We push responses to socket write queue directly, so there is no prioritization. Because of reworking framing in Tempesta FW make some changes in Linux kernel: - Remove unused `sk_prepare_xmit` callback - Implement new `sk_fill_write_queue` callback, which is used to move skbs from Tempesta private write queue to socket write queue. - Add new `SOCK_TEMPESTA_HAS_DATA` socket flag, which is used to indicate, that there is some data in Tempesta FW write queue. Now we push response skbs to our private scheduler queue. Then we call `tcp_push_pending_frames` where we run our scheduler to choose the most priority stream and send data according TCP window. Also `tcp_push_pending_frames` is called from `tcp_data_snd_check` when we receive data from the peer and TCP window opened again.
Since now we don't use skb private data in xmit callbacks, we can use `skb->cb` to save skb private data for our purposes.
- Codestyle fixes and move `tso_fragment` to static scope. - Close socket immediately in case when new socket callback sk_fill_write_queue return error. - `sk_fill_write_queue` returns positive value if there is not sent data in our scheduler.
Use minimum of tcp sender window, tcp receiver window and tls max payload size to calculate size of tls record instead of using mss for this purpose.
Pass TCP_NAGLE_OFF | TCP_NAGLE_PUSH flags when we call `__tcp_push_pending_frames` frames in kernel code as we do when we call it from Tempesta FW source code. Export `tcp_cwnd_restart` to be able to call `tcp_slow_start_after_idle_check` from Tempesta FW code.
We should check `seg_size` in `tcp_write_wakeup` only when it uses as skb->len limit.
In Tempesta FW code we need to implement different behaviour depends on place where `sk_fill_write_queue` is called. (We need to send `tcp_shutdown` if this function is called from `ss_shutdown` and do not send it if it is called from any other place.
There are a lot of places in our source code, which are not clear.
9092882
to
b92249a
Compare
No description provided.